A Sentiment Analysis of Ted Talks

THE DATA:

The data used in this analysis was found on Kaggle (https://www.kaggle.com/rounakbanik/ted-talks) and presented in two CSV files. The data was originally scraped from the official TED Website.

The context for the data, as provided on Kaggle is as follows:

“These datasets contain information about all audio-video recordings of TED Talks uploaded to the official TED.com website until September 21st, 2017. The TED main dataset contains information about all talks including number of views, number of comments, descriptions, speakers and titles. The TED transcripts dataset contains the transcripts for all talks available on TED.com.”

In the following analysis, we are examining word choice and overall sentiment among Ted Talks with respect to other variables provided in the data.

Jess’ Analysis

One of the variables provided in the data set is ‘ratings’ which provides a list of various ratings given to the talks provided by the viewers. For each talk, each rating has a number associated with it which represents the number of votes each rating received. I have chosen to define the max rating of each talk as the rating that received the most votes. Before looking at sentiment and word choice in the talks, I will to look at the distribution of max ratings among all ted talks.

Before performing any sentiment analysis or looking at word choice, I want to look at the overall distribution of words used across all Ted Talks. In the plot below, we see that the word ‘like’ has approximately double the frequency of the next most common word. We know that this is a filler word that does not have a true sentiment or postive/negative connotation in colloquial English, therefore, we will exclude ‘like’ from further plots and analyses. We will also remove the word “right” as it is the 3rd most common word used in Ted Talks and it’s sentiment is not obvious out of context - it could be a positive word meaning correct but it could also mean the direction which doesn’t hold a true sentiment.

The distribution of the top 50 words used in ted talks after removing “like” + “right”

Next, I will use the nrc lexicon to look at the distribution of sentiments, as determined by word choice, among ted talks when grouped by their max rating. The frequency is presented as a proportion here because, as seen above, the number of talks within each max rating group is not uniform.

When looking at the plots below, there does not appear to be any large differences in the distribution among ted talks based on their max rating.

To look more at what may have contributed to the max rating a Ted Talk received from its’ viewers, I will look at the top ten words used in Ted Talks as grouped by max rating. Again the frequencies are presented as proportions to account for differences in sample size among groups.

In the plots below, we see a lot of the most common words - from the word distribution plots above - represented in many or all of the max rating categories. 11 out of the 14 groups have the word “well” as their most frequenty used word. The words “work”, “great”, “good”, “thank”, and “problem” are other common words among all groups.

Some interesting findings shown by these plots are as follows. “Fear” and “death” are among the top 10 words used in talks with a max rating of courageous. “Cancer” is one of the top ten words used in talks with a max rating of informative. Talks with an obnoxious max rating had the most interesting set of top ten words used - “risk”, “complex”, “virus”, “unexpected”, “injury”, “died”. OK talks also had a unique list of top ten words, including “award”, “winner”, and “cave”.

Instead of looking at the distribution of sentiments provided by the nrc lexicon, I will now use the bing lexicon to look at the distribution of positive versus negative Ted Talks are based on their max rating.

Most of the talks appear to have a higher frequency of words that contribute to positive sentiment. The talks with max ratings “Courageous”, “Obnoxious”, and “OK” have more negative words than postive words. We will explore these talks further to see which words contribute to their sentiments in both the bing and nrc lexicons.

For Ted Talks with a max rating of “Courageous”, “Obnoxious”, or “OK” I will present plots of the top 20 positive and negative words used in those talks.

When the max rating is “Courageous” we see the top 20 negative words used are words that are associated with topics that are difficult to talk about - death, corruption, killed, lost, die, died, dead, pain, suicide, depression.

When the max rating is “Obnoxious” - the words or topics that may contribute to the higher negative sentiment are not obvious. We see an interesting group of words in the top 20 most used negative words - virus, neurotic, weird, untouched, epidemic, blunt.

When the max rating is “OK” we see an even less cohesive group of words that contribute to the negative sentiment. The top 20 negative words include - silly, limited, funky, dizzy, died, destruction, cheap, blind, and betray. These words do have a negative connotation but they are not obviously associated with things that are difficult to talk about as seen in the “Courageous” group. This mismatched group of negative words makes sense for a talk that was rated “OK”. Nothing too negative or positive.


Blain’s Analysis

Which talks had the highest proportion of positive words?

Most Positive Talks:
Name Word Count n p positive
Onora O’Neill: What we don’t understand about trust 1380 103 0.0746377
Ron Gutman: The hidden power of smiling 1006 75 0.0745527
Chade-Meng Tan: Everyday compassion at Google 1731 123 0.0710572
Richard St. John: 8 secrets of success 533 36 0.0675422
Richard St. John: Success is a continuous journey 671 44 0.0655738
Jill Shargaa: Please, please, people. Let’s put the ‘awe’ back in ‘awesome’ 845 53 0.0627219
Wanuri Kahiu: Fun, fierce and fantastical African art 649 40 0.0616333
John Legend: “Redemption Song” 792 48 0.0606061
Don Levy: A cinematic journey through visual effects 567 33 0.0582011
Mandy Len Catron: A better way to talk about love 2104 122 0.0579848
Larry Smith: Why you will fail to have a great career 2019 117 0.0579495
Jane McGonigal: Massively multi-player… thumb-wrestling? 1333 77 0.0577644
Denis Dutton: A Darwinian theory of beauty 1901 109 0.0573382
David Steindl-Rast: Want to be happy? Be grateful 1788 102 0.0570470
Einstein the Parrot: A talking, squawking parrot 973 55 0.0565262
Jenna McCarthy: What you don’t know about marriage 1597 90 0.0563557
Raul Midon: “Peace on Earth” 599 33 0.0550918
Michelle Obama: A passionate, personal case for education 1565 86 0.0549521
Martin Seligman: The new era of positive psychology 3235 177 0.0547141
Dan Dennett: Cute, sexy, sweet, funny 1049 57 0.0543375

To obtain this chart we filtered presentations that had fewer than 510 words. This filter was chosen to help remove some artistic performances, which we did not want to include in the analysis. However, we see that “John Legend: Redemption Song” still made it through.

Many of the talks featured in the above table are what we would expect. Most have a positive sentiment in their title (such as “The Hidden Power”). Some talks were expected to be positive because of the speaker’s position (for example, Michelle Obama makes it on the list).

Which words contributed the most often to sentiment scores?

We can see from the chart above that the word “problem” contributed the most toward negative sentiment scores and that the word “well” contributed the most towards a positive score. It is interesting that the word “cancer” appears more times in the talks than “happy.” However, this observation is not suprising when considering the problem solving nature of the TED conference.

Did sentiment change over time?

From the chart above, there does seem to be a trend of decreasing positivity in TED talks over time. Three out of four times, when culture is the top tag of the year, the proportion of sentiment words seems to sharply decrease. To investigate this further, we will look to see if there was a difference in sentiment between talks that had a “culture” tag and talks that had a “technology” tag. But first, let’s take another look at the proportion of sentiment words using a ridge plot:

## Picking joint bandwidth of 0.0484
## Picking joint bandwidth of 0.0484

The ridge plot highlights that the proportion of sentiment words is usually higher on average than the proportion of negative words. However, this difference seems to be decreasing over time. The gray areas where the ridgelines overlap represent the number of talks where there was actually a higher proportion of negative words.

Was there a difference in positive sentiment between talks that were tagged “culture” and talks that were tagged “technology”?

At 95% confidence, there does not seem to be a difference in the proportion of positive sentiment words between culture talks and technology talks. Word count on the X axis would not always be a valid way to compare these proportions, but in this case there is a similar inverse trend. We can also look at the difference in mean positive sentiment word proportion:

Culture Technology
0.5738392 0.6046312

So, the overall difference between positive sentiment word proportion between the two tags is less than 3%.

Was there a difference in the trend of positive sentiments between culture and technology talks?

There does appear to be a difference in trends between culture talks and techonolgy talks: culture talks seems to have decreased in positivity whereas positivity in the technology talks has remained relatively constant.